Python Beautiful Soup库实用笔记

共计 2368 个字符，预计需要花费 6 分钟才能阅读完成。

pip install beautifulsoup4
pip install lxml # 安装解析器 lxml

Beautiful Soup 将复杂的 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种：Tag，NavigableString，BeautifulSoup，Comment。

Tag 就是 HTML 中的标签。首先，用 HTML 创建一个 Beautiful Soup 对象。

soup = BeautifulSoup(html, 'lxml')
head = soup.head # 提取 head 标签
title = soup.title # 提取 title 标签
p = soup.p # 提取 p 标签

标签有两个重要的属性：name 和 attrs。

name = soup.p.name
attrs = soup.p.attrs # 标签的所有属性是一个字典

获取标签内部的文字，用 .string 即可，例如：

title = soup.title
print(title.string)

get_text() 用来获取标签中所有字符串包括子标签的内容，返回的是 unicode 类型的字符串。实际场景中一般使用 get_text() 方法获取标签中的内容。

BeautifulSoup 对象表示一个文档的全部内容，大部分时候，它是一个特殊的 Tag，可以分别获取它的类型、名称和属性。

print(type(soup))
print(soup.name)
print(soup.attrs)

Comment 对象是特殊类型的 NavigableString 对象，在对其进行内容输出的时候，是不包括注释符号的。

# <a href="https://docs.python.org/zh-cn/3/"  id="link1"><!-- Python 官方文档 --></a>
a = soup.a
print(a.string) # 输出：Python 官方文档

在利用 .string 进行内容输出的时候，注释符号被去掉了，这可能会给我们带来不必要的麻烦。所以，在使用之前最好做下判断，判断代码如下：

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

标签的 .contents 属性可以将标签的子节点以列表的方式输出：

head = soup.head
print(head.contents)

标签的 .children 属性可以将标签的子节点以列表生成器的方式输出：

head = soup.head
## 通过遍历生成器的方式来获得里面的内容
for item in head.children:
    print(item)

标签的 .descendants 属性包含标签的所有子孙节点。

head = soup.head
for item in head.descendants:
    print(item)

.strings 属性可以获取多个内容，需要通过遍历获取。

for str in soup.strings:
    print(repr(str)) # repr() 函数将对象转化为供解释器读取的形式

使用 .strings 输出的字符串包含了很多空格和空行，使用 .stripped_strings 可以去除空白内容。

for str in soup.stripped_strings:
    print(repr(str))

通过 .parent 属性来获取元素的父节点。

p = soup.p
print(p.parent.name)

通过 .parents 属性来递归获取元素的所有父节点。

.next_sibling 属性获取该节点的下一个兄弟节点，.previous_sibling 则与之相反，如果节点不存在，则返回 None。注意：实际文档中标签的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行。

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出。

for sibling in soup.img.next_siblings:
    print(repr(sibling))

.next_element .previous_element 属性，与 .next_siblings .previous_siblings 不同，它并不是针对于兄弟节点，而是所有节点，不分层次。

通过 .next_elements 和 .previous_elements 的迭代器可以向前或向后访问文档的解析内容。

for element in soup.img.next_elements:
    print(repr(element))

find_all(name, attrs, recursive, text,kwargs)
# 查找所有 p 标签
print(soup.find_all('p'))
# 搜索属性值符合条件的标签
print(soup.find_all(id='link1'))

# 查找标签名为 title 的标签
print(soup.select('title'))
# 查询类名为 python 的标签
print(soup.select('.python'))
# 查找 id 为 link1 的标签
print(soup.select('#link1'))
# 组合查找：查找 p 标签中，id 等于 link1 的内容，二者用空格分开
print(soup.select('p #link1'))
# 查找 class 值为 "python" 的 p 标签
print(soup.select('p[class="python"]'))

阿伯手记发了：https://aboss.top/moments/1064

吴蛋蛋快发小年快乐

吴蛋蛋 Ask4Me，这个之前看server酱接入了

15220202929 怎么用

八对麻烦大佬更新下【堆新】的友链站名：八对星星描述：极目星视穹苍无界•足履行者大地有疆链接：https://8dui.com图标：https://cf.8dui.com/logo.webp横标：https://cf.8dui.com/logo-w.webp订阅：https://8dui.com/rss.xml

三毛笔记已添加

DUINEW 已添加贵站，期待贵站友链~博客名称：堆新博客地址：https://duinew.com/博客描述：堆新堆新,引力向新！——堆新（DUINEW）博客头像：https://d.duinew.com/logo.webp横版头像：https://d.duinew.com/logo-w.webp博客订阅：https://duinew.com/rss.xml